Refactor and update newsscraper #3

fawazshah · 2021-04-22T15:58:21Z

This PR aims to add data checkpointing and extra error-handling, as well as improving the readability of code.

We now catch errors when calling newspaper.build
We increment a new variable error_count when we encounter any errors when downloading/parsing articles, or any NoneType publish dates. If error_count > 10 we skip to the next article. (Previously we only skip if encountering 10 or more NoneType dates only)
We remove the unneeded count function parameter
We print which number news site out of the total number of sites we are scraping right now (e.g. "NEWS SITE 3 OUT OF 99")
We now save scraped data to JSON after each news site is processed rather than at the very end of processing, meaning if the script gets interrupted any data collected so far is saved
We remove the default limit parameter in run so it doesn't override the user-inputted limit

holwech and others added 4 commits April 22, 2021 16:25

Update README.md

a2aa424

Update newsscraper script

0404b55

Change count to num_articles_downloaded, begins from 0

7a7289f

Change 'company' to 'newspaper'

027782c

Provide feedback